Multimodal People Detection and Tracking in Crowded Scenes
نویسندگان
چکیده
This paper presents a novel people detection and tracking method based on a multi-modal sensor fusion approach that utilizes 2D laser range and camera data. The data points in the laser scans are clustered using a novel graph-based method and an SVM based version of the cascaded AdaBoost classifier is trained with a set of geometrical features of these clusters. In the detection phase, the classified laser data is projected into the camera image to define a region of interest for the vision-based people detector. This detector is a fast version of the Implicit Shape Model (ISM) that learns an appearance codebook of local SIFT descriptors from a set of hand-labeled images of pedestrians and uses them in a voting scheme to vote for centers of detected people. The extension consists in a fast and detailed analysis of the spatial distribution of voters per detected person. Each detected person is tracked using a greedy data association method and multiple Extended Kalman Filters that use different motion models. This way, the filter can cope with a variety of different motion patterns. The tracker is asynchronously updated by the detections from the laser and the camera data. Experiments conducted in real-world outdoor scenarios with crowds of pedestrians demonstrate the usefulness of our approach. Introduction The ability to reliably detect people in real-world environments is crucial for a wide variety of applications including video surveillance and intelligent driver assistance systems. According to the National Highway Traffic Safety Administration report (NHTSA 2007) there were 4784 pedestrian fatalities in United States during the year 2006, which accounted for 11.6% of the total 42642 traffic related fatalities. In countries of Asia and Europe, the percentage of pedestrian accidents is even higher. The number of such accidents could be reduced if cars were equipped with systems that can automatically detect, track, and predict the motion of pedestrians. However, pedestrians are particularly difficult to detect because of their high variability in appearance due to clothing, illumination and the fact that the shape characteristics depend on the view point. In addition, occlusions caused by carried items such as backpacks, as well as clutter in crowded scenes can render this task even more complex, because they dramatically change the shape of a pedestrian. Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Our goal is to detect pedestrians and localize them in 3D at any point in time. In particular, we want to provide a position and a motion estimate that can be used in a realtime application, e.g. online path planning in crowded environments. The real-time constraint makes this task particularly difficult and requires faster detection and tracking algorithms than the existing approaches. Our work makes a contribution into this direction. The approach we propose is multimodal in the sense that we use 2D laser range data and CCD camera images cooperatively. This has the advantage that both geometrical structure and visual appearance information are available for a more robust detection. In this paper, we exploit this information using supervised learning techniques based on a combination of AdaBoost with Support Vector Machines (SVMs) for the laser data and on an extension of the Implicit Shape Model (ISM) for the vision data. In the detection phase, both classifiers yield likelihoods of detecting people which are fused into an overall detection probability. Finally, each detected person is tracked using multiple Extended Kalman Filters (EKF) with three different motion models and a greedy data association. This way, the filter can cope with different motion patterns for several persons simultaneously. The tracker is asynchronously updated by the detections from the laser and the camera data. The major contributions of this work are: • An improved version of the image-based people detector by Leibe et al. (2005). The improvement consists in two extensions to the ISM for a reduced computation time to make the approach better suited for real-time applications. • A tracking algorithm based on EKF with multiple motion models. The filter is asynchronously updated with the detection results from the laser and the camera. • The integration of our multimodal people detector and the tracker into a robotic system that is employed in a real outdoor environment. This paper is organized as follows. The next section describes previous work that is relevant for our approach. Then, we give an overview of our overall people detection and tracking system. Section 4 presents our detection method based on the 2D laser range data. Then, we introduce the Implicit Shape Model (ISM) and our extensions to the ISM. Subsequently, we explain our EKF-based tracking algorithm with a focus on the multiple motion models we Figure 1: Overview of the individual steps of our system. See text for details. use. Finally, we describe our experiments and conclusions. Previous Work Several approaches can be found in the literature to identify a person in 2D laser data including analysis of local minima (Scheutz, Mcraven, & Cserey 2004; Schulz et al. 2003; Topp & Christensen 2005), geometric rules (Xavier et al. 2005), or a maximum-likelihood estimation to detect dynamic objects (Hähnel et al. 2003). Most similar to our work is the approach of Arras, Mozos, & Burgard (2007) which clusters the laser data and learns an AdaBoost classifier from a set of geometrical features extracted from the clusters. Recently, we extended this approach (Spinello & Siegwart 2008) by using multi-dimensional features and learning them using a cascade of Support Vector Machines (SVM) instead of the AdaBoost decision stumps. In this paper, we will make use of that work and combine it with an improved appearance-based people detection and an EKFbased tracking algorithm. In the area of image-based people detection, there mainly exist two kinds of approaches (see Gavrila (1999) for a survey). One uses the analysis of a detection window or templates (Gavrila & Philomin 1999; Viola, Jones, & Snow 2003), the other performs a parts-based detection (Felzenszwalb & Huttenlocher 2000; Ioffe & Forsyth 2001). Leibe, Seemann, & Schiele (2005) presented an image-based people detector using Implicit Shape Models (ISM) with excellent detection results in crowded scenes. Existing people detection methods based on camera and laser rangefinder data either use hard constrained approaches or hand tuned thresholding. Cui et al. (2005) use multiple laser scanners at foot height and a monocular camera to obtain people tracking by extracting feet and step candidates. Zivkovic & Kröse (2007) use a learned leg detector and boosted Haar features extracted from the camera images to merge this information into a parts-based method. However, both the proposed approach to cluster the laser data using Canny edge detection and the extraction of Haar features to detect body parts is hardly suited for outdoor scenarios due to the highly cluttered data and the larger variation of illumination encountered there. Therefore, we use an improved clustering method for the laser scans and SIFT features for the image-based detector. Schulz (2006) uses probabilistic exemplar models learned from training data of both sensors and applies a Rao-Blackwellized particle filter (RBPF) in order to track the person’s appearance in the data. The RBPF tracks contours in the image based on Chamfer matching as well as point clusters in the laser scan and computes the likelihood of different prototypical shapes in the data. However, in outdoor scenarios lighting conditions change frequently and occlusions are very likely, which is why contour matching is not appropriate. Moreover, the RBPF is computationally demanding, especially in crowded environments. Several methods have been proposed to track moving objects in sequential data (see Cox (1993) for an overview). The most common ones include the joint likelihood filter (JLF), the joint probabilistic data association filter (JPDAF), and the multiple hypothesis filter (MHF). Unfortunately, the exponential complexity of these methods makes them inappropriate for real-time applications such as navigation and path planning. Cox & Miller (1995) approximate the MHF and JPDA methods by applying Murty’s algorithm and demonstrate in simulations the resulting speedup for the MHF method. Rasmussen & Hager (2001) extend the JLM, JPDA, and MHF algorithms to track objects represented by complex feature combinations. Schumitsch et al. (2006) propose a method to reduce the complexity of MHT methods introducing the Identity Management Kalman Filter (IMKF) for entities with signature. Overview of the method Our system is divided into three phases: training, detection and tracking (see Fig. 1). In the training phase, the system learns a structure-based classifier from a hand-labeled set of 2D laser range scans, and an appearance-based classifier from a set of labeled camera images. The first one uses a boosted cascade of linear SVMs, while the latter computes an ISM, in which a collected set of image descriptors from the training set vote for the occurrence of a person in the test set. In the detection phase, the laser-based classifier is applied to the clusters found in a new range scan and a probability is computed for each cluster to correspond to a person. The clusters are then projected into the camera image to define a region of interest, from which the appearance-based classifier extracts local image descriptors and computes a set of hypotheses of detected persons. Here, we apply a new technique to discard false positive detections. Finally in the tracking phase, the information from both classifiers is used to track the position of the people in the scan data. The tracker is updated whenever a new image or a laser measurement is received and processed. It applies several motion models per track to account for the high variety of possible motions a person can perform. In the following, we describe the particular steps of our system in detail. Structure Information from Laser Data Analysis We assume that the robot is equipped with a laser range sensor that provides 2D scan points (x1, ...,xN ) in the laser plane. We detect a person in a range scan by first clustering the data and then applying a boosted classifier on the clusters, which we describe as follows.
منابع مشابه
Online multiple people tracking-by-detection in crowded scenes
Multiple people detection and tracking is a challenging task in real-world crowded scenes. In this paper, we have presented an online multiple people tracking-by-detection approach with a single camera. We have detected objects with deformable part models and a visual background extractor. In the tracking phase we have used a combination of support vector machine (SVM) person-specific classifie...
متن کاملReal-Time Tracking of Multiple People Using Continuous Detection
Recent investigations have shown the advantages of keeping multiple hypotheses during visual tracking. In this paper we explore an alternative method that keeps just a single hypothesis per tracked object for computational e ciency, but displays robust performance and recovery from error by employing continuous detection during tracking. The method is implemented in the domain of people-trackin...
متن کاملRobust Recognition of Specific Human Behaviors in Crowded Surveillance Video Sequences
We describe a method that can detect specific human behaviors even in crowded surveillance video scenes. Our developed system recognizes specific behaviors based on the trajectories created by detecting and tracking people in a video. It detects people using an HOG descriptor and SVM classifier, and it tracks the regions by calculating the two-dimensional color histograms. Our system identifies...
متن کاملTracking People with Integrated Stereo, Color, and Face Detection
We present an approach to robust, real-time person tracking in crowded and/or unknown environments using multimodal integration. We combine stereo, color, and face detection modules into a single robust system, and show an initial application for an interactive display where the user sees his face distorted into various comic poses in real-time. Stereo processing is used to isolate the figure o...
متن کاملTracking Using Motion Patterns for Very Crowded Scenes
This paper proposes Motion Structure Tracker (MST) to solve the problem of tracking in very crowded structured scenes. It combines visual tracking, motion pattern learning and multi-target tracking. Tracking in crowded scenes is very challenging due to hundreds of similar objects, cluttered background, small object size, and occlusions. However, structured crowded scenes exhibit clear motion pa...
متن کاملVisual Analysis of Crowded Pedestrian Scenes
The tracking of individuals in cluttered scenes has been of much interest in Machine Vision research for more than a decade. A number of algorithms have been devised to track individuals and some of the algorithms have attempted the tracking of large groups of people. In this paper we discuss how to generate automatic maps of trends of movement in complex scenes, without the use of tracking. In...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008